The \eslmod{sqio} module contains routines for input from unaligned
sequence data files, such as FASTA files.

Several different common sequence file formats are understood, and can
be automatically recognized.

Sequences can be read sequentially from multiple sequence alignment
files, as if the MSA file was an unaligned sequence file, when the
module is augmented with the \eslmod{msa} module.

Sequences can be read from normal files, directly from the
\ccode{stdin} pipe, or from \ccode{gzip}-compressed files.

Sequences can be read from NCBI BLAST database files.

Sequence files can be automatically looked for in a list of one or
more database directories, specified by an environment variable (such
as \ccode{BLASTDB}).

Table~\ref{tbl:sqio_api} lists the functions in the \eslmod{sqio} API.
The module uses an \ccode{ESL\_SQFILE} object which works much like an
ANSI C \ccode{FILE}, maintaining information for an open sequence file
while it's being read.

% Table generated by autodoc -t esl_sqio.c (so don't edit here, edit esl_sqio.c:)
\begin{table}[hbp]
\begin{center}
{\small
\begin{tabular}{|ll|}\hline
\apisubhead{The \ccode{ESL\_SQFILE} object.}\\
\hyperlink{func:esl_sqfile_Open()}{\ccode{esl\_sqfile\_Open()}} & Description.\\
\hyperlink{func:esl_sqfile_Close()}{\ccode{esl\_sqfile\_Close()}} & Description.\\
\apisubhead{Sequence input/output}\\
\hyperlink{func:esl_sqio_Read()}{\ccode{esl\_sqio\_Read()}} & Description.\\
\hyperlink{func:esl_sqio_Write()}{\ccode{esl\_sqio\_Write()}} & Description.\\
\hyperlink{func:esl_sqio_Echo()}{\ccode{esl\_sqio\_Echo()}} & Echo the next sequence record onto output stream\\
\hyperlink{func:esl_sqio_GuessFileFormat()}{\ccode{esl\_sqioGuessFileFormat()}} & Description.\\
\hyperlink{func:esl_sqio_EncodeFormat()}{\ccode{esl\_sqio\_EncodeFormat()}} & Description.\\
\hyperlink{func:esl_sqio_DecodeFormat()}{\ccode{esl\_sqio\_DecodeFormat()}} & Returns descriptive string for file format code.\\
\hyperlink{func:esl_sqio_IsAlignment()}{\ccode{esl\_sqio\_IsAlignment()}} & Description.\\
\hyperlink{func:esl_sqio_PositionByKey()}{\ccode{esl\_sqio\_PositionByKey()}} & Description.\\
%\hyperlink{func:esl_sqio_Rewind()}{\ccode{esl\_sqio\_Rewind()}} & Description.\\ 
\hyperlink{func:esl_sqfile_GuessAlphabet()}{\ccode{esl\_sqfile\_GuessAlphabet()}} & Guess the alphabet of an open \ccode{ESL\_SQFILE}\\
\apisubhead{Fast random access in a seqfile  [with SSI augmentation]}\\
\hyperlink{func:esl_sqfile_OpenSSI()}{\ccode{esl\_sqfile\_OpenSSI()}} & Opens an SSI index associated with a seq file.\\
\hyperlink{func:esl_sqfile_PositionByKey()}{\ccode{esl\_sqfile\_PositionByKey()}} & Use SSI to reposition seq file to a particular sequence.\\
\hyperlink{func:esl_sqfile_PositionByNumber()}{\ccode{esl\_sqfile\_PositionByNumber()}} & Use SSI to reposition by sequence number\\
\hline
\end{tabular}
}
\end{center}
\caption{The \eslmod{sqio} API.}
\label{tbl:sqio_api}
\end{table}

\subsection{Example: reading sequences from a file}

Figure~\ref{fig:sqio_example} shows a program that opens a file, reads
sequences from it one at a time, then closes the file.

\begin{figure}
\input{cexcerpts/sqio_example}
\caption{Example of reading sequences from a file.}
\label{fig:sqio_example}
\end{figure}

A FASTA file named \ccode{seqfile} is opened for reading by calling
\ccode{esl\_sqfile\_Open(filename, format, env, \&sqfp)}, which
creates a new \ccode{ESL\_SQFILE} and returns it through the
\ccode{sqfp} pointer. If the \ccode{format} is passed as
\ccode{eslSQFILE\_UNKNOWN}, then the format of the file is
autodetected; here, we bypass autodetection by asserting that the file
is in FASTA format by passing a \ccode{eslSQFILE\_FASTA} code. (See
below for a list of valid codes and formats.) The optional \ccode{env}
argument is described below too; here, we're passing \ccode{NULL} and
not using it.

Several things can go wrong in trying to open a sequence file that are
beyond the control of Easel or your application, so it's important
that you check the return code.  \ccode{esl\_sqfile\_Open()} returns
\ccode{eslENOTFOUND} if the file can't be opened; \ccode{eslEFORMAT}
if the file is empty, or if autodetection can't determine its format;
and \ccode{eslEINVAL} if you try to autodetect format on an input
stream that can't be autodetected (a nonrewindable stream: see below
for info about reading from \ccode{stdin} and compressed
files). (Additionally, an internal error might be thrown, which you
should check for if you installed a nonfatal error handler).

The file is then read one sequence at a time by calling
\ccode{esl\_sq\_Read(sqfp, sq)}. This function returns \ccode{eslOK}
if it read a new sequence, and leaves that sequence in the \ccode{sq}
object that the caller provided.  When there is no more data in the
file, \ccode{esl\_sq\_Read()} returns \ccode{eslEOF}. 

If at any point the file does not appear to be in the proper format,
\ccode{esl\_sq\_Read()} returns \ccode{eslEFORMAT}. The application
must check for this. The API provides a little information about what
went wrong and where. \ccode{sqfp->filename} is the name of the file
that we were parsing (not necessarily the same as \ccode{seqfile};
\ccode{sqfp->filename} can be a full pathname if we used an
\ccode{env} argument to look for \ccode{seqfile} in installed database
directories). The function \ccode{esl\_sqfile\_GetErrorBuf} should be
called to get a pointer to the generated error message.  The buffer
 is a brief explanatory message that gets filled in when a 
\ccode{eslEFORMAT} error occurs.
  \footnote{Unlike in the MSA module, you don't get access to the
  current line text; some of sqio's parsers use fast block-based
  (\ccode{fread()}) input instead of line-based input.}

We can reuse the same \ccode{ESL\_SQ} object for subsequent sequences
by calling \ccode{esl\_sq\_Reuse()} on it when we're done with the
previous sequence. If we wanted to load a set of sequences, we'd
\ccode{\_Create()} an array of \ccode{ESL\_SQ} objects. 

Finally, to clean up properly, a \ccode{ESL\_SQ} that was created is
destroyed with \ccode{esl\_sq\_Destroy(sq)}, and a \ccode{ESL\_SQFILE}
is closed with \ccode{esl\_sqfile\_Close()}.

\subsection{Accepted formats}

Accepted unaligned sequence file formats (and their Easel format
codes) are:

\begin{tabular}{ll}
\ccode{eslSQFILE\_DDBJ}     & DDBJ flat text DNA database format \\
\ccode{eslSQFILE\_EMBL}     & EMBL flat text DNA database format \\
\ccode{eslSQFILE\_FASTA}    & FASTA format \\
\ccode{eslSQFILE\_GENBANK}  & GenBank flat text DNA database format \\
\ccode{eslSQFILE\_UNIPROT}  & UniProt flat text protein database format \\
\ccode{eslSQFILE\_NCBI}     & NCBI BLAST database format version 4  \\
\end{tabular}

Additionally, the code \ccode{eslSQFILE\_UNKNOWN} is recognized. It
tells \ccode{esl\_sqfile\_Open()} to perform format autodetection.

\subsection{Special input streams: stdin and compressed files}

There are two special cases for input files. 

The module can read sequence input from a stdin pipe. If the
\ccode{seqfile} argument is ``-'', \ccode{esl\_sqfile\_Open()} ``opens''
standard input (really, it just associates \ccode{stdin}, which is
always open, with the \ccode{ESL\_SQFILE}). 

The module can read compressed sequence files. If the \ccode{seqfile}
argument to \ccode{esl\_sqfile\_Open()} ends in \ccode{.gz}, the file is
assumed to be compressed with \ccode{gzip}; instead of opening it
normally, \ccode{esl\_sqfile\_Open()} opens it as a pipe from
\ccode{gzip -dc}. Your system must support pipes to use this -
specifically, it must support the \ccode{popen()} system call (POSIX.2
compliant operating systems do). The \ccode{configure} script
automatically checks this at compile-time and defines
\ccode{HAVE\_POPEN} appropriately. Obviously, the user must also have
\ccode{gzip} installed and in his PATH.

For both special cases, the catch is that you can't use format
autodetection; you must provide a valid known format code when you
read from stdin or from a compressed file. Pipes are not rewindable,
and format autodetection relies on a two-pass algorithm: it reads
partway into the file to determine the format, then rewinds to start
parsing for real.

\subsection{Augmentations}

The sqio module is optionally augmented by up to two additional
modules, as follows:

\subsubsection{msa: read unaligned sequences sequentially from an alignment}

If sqio is augmented with the msa module, then the sqio API gains the
ability to read alignment file formats in addition to unaligned file
formats. The sqio API remains exactly the same (the caller doesn't
have to use any msa module functions).

\subsubsection{alphabet: digitized sequences}

At present, only placeholders exist in the code for this augmentation.
The plan is to provide the ability to input sequences directly into
\ccode{dsq} as pre-digitized sequences.


\subsubsection{ncbi: read NCBI BLAST databases}

If sqio is augmented with the ncbi module, then the sqio API gains the
ability to read NCBI BLAST database formats in addition to ASCII file
formats. The sqio API remains exactly the same (the caller doesn't
have to use any msa module functions).

To open a BLAST database, the format must be supplied.  If a format
of \ccode{eslSQFILE\_UNKNOWN} is specified, only ASCII based format,
i.e. FASTA, GENBANK, etc. will be opened.

When opening a BLAST database, the file name should not include any
extensions.  If the \ccode{ESL\_ALPHABET} is not specified, Easel
will first try to open a protein database, followed by a DNA database
and finally a multi-volume database.  The handling of a multi-volume
database is done through the alias file.  The only directive in
the alias file is the DBLIST line, listing all the database volumes.




\subsection{Adding a sequence parser}

The \eslmod{sqio} module is designed to handle multiple sequence
formats.  Parsers for different formats can be plugged into the
\eslmod{sqio} module without any API changes.  This means that
programs using the Easel library won't need code changes to
use the new parser.

A new parser will need a format type, a structure for parser
specific data, API functions and a hook into the 
\ccode{sqfile\_open} function.

The list of formats are defined in \ccode{esl\_sqio.h}.  A new
\ccode{\#define} will be added to the existing formats:

\input{cexcerpts/sq_sqio_format}

A data structure for parser specific data will need to be added
to \ccode{ESL\_SQDATA}.  This structure is a union of the 
different parser specific data structures.  If possible the
addition of the parser structure should be augmented so its
inclusion can be controlled.

\input{cexcerpts/sq_sqio_data}

Finally, the routine specific function pointers neeed to be
defined.  The functions in \ccode{esl\_sqio.c} in turn call
these function pointers.  The \ccode{esl\_sqfile\_Open} function
initializes the function pionters to NULL, so if they are not
set an exception will occur when the function is called.  At
a minimum, the function should be defined to return an
\ccode{eslEUNIMPLEMENTED}.  Below is a map the the function
pointers to their respective function.

\begin{tabular}{ll} \\
Function Pointer & ESL\_SQIO Function \\ \hline
position & esl\_sqfile\_Position \\
close & esl\_sqfile\_Close \\
set\_digital & esl\_sqfile\_SetDigital \\
guess\_alphabet & esl\_sqfile\_GuessAlphabet \\
read & esl\_sqio\_Read  \\
read\_info & esl\_sqio\_ReadInfo   \\
read\_seq & esl\_sqio\_ReadSequence \\
read\_window & esl\_sqio\_ReadWindow \\
echo & esl\_sqio\_Echo  \\
read\_block & esl\_sqio\_ReadBlock  \\
open\_ssi & esl\_sqfile\_OpenSSI     \\
pos\_by\_key & esl\_sqfile\_PositionByKey    \\
pos\_by\_number & esl\_sqfile\_PositionByNumber \\
fetch & esl\_sqio\_Fetch       \\
fetch\_info & esl\_sqio\_FetchInfo   \\
fetch\_subseq & esl\_sqio\_FetchSubseq \\
is\_rewindable & esl\_sqfile\_IsRewindable  \\
get\_error & esl\_sqfile\_GetErrorBuf \\
\end{tabular}

\bigskip
A hook needs to be added to the function \ccode{sqfile\_open}.
This hook will try to open the specified file.  If successful,
the \ccode{ESL\_SQFILE} structure should be filled in with function
pointers and the parser specific data and the open hook
return \ccode{eslOK}.  If the sequence files were not found for
the specific parser, an \ccode{eslENOTFOUND} is returned and
the next parser tries to open the file.  Below is an example
of code that tries to open an NCBI BLAST database if not
successful, then the ASCII sequence parsers try to open the
file.

\begin{cchunk}
#ifdef eslAUGMENT\_NCBI
    if (format == eslSQFILE\_NCBI && status == eslENOTFOUND)
      status = esl\_sqncbi\_Open(sqfp->filename, sqfp->format, sqfp);
#endif
    if (status == eslENOTFOUND)
      status = esl\_sqascii\_Open(sqfp->filename, sqfp->format, sqfp);
\end{cchunk}
